2025-05-12-12-04
Understanding Stragglers in Large Model Training Using What-if Analysis
Abstract
arXiv:2505.05713v1 Announce Type: new Abstract: Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
摘要
大语言模型(LLM)训练是当前最具挑战性的分布式计算任务之一,通常需要数千个GPU并频繁进行跨机器同步。这种工作负载模式使其容易受到落后节点(stagger)影响,少数速度较慢的工作节点即可导致整个训练停滞。字节跳动研究发现,落后节点并非总是由硬件故障简单引起,而是可能源于多种复杂因素。本研究基于字节跳动LLM训练集群的五个月追踪数据,旨在对LLM训练中的落后节点问题进行系统性分析。核心研究方法是通过假设分析模拟无落后节点的理想场景,并与实际情况进行对比。我们运用该方法探究以下问题:(1)落后节点影响训练任务的频率及其对作业性能的影响程度;(2)落后节点是否呈现时间或空间上的规律性;(3)导致落后节点的潜在根本原因有哪些?
An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact
Abstract
arXiv:2505.05494v1 Announce Type: new Abstract: The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.
摘要
欧盟《反森林砍伐条例》(EUDR)要求企业证明其产品未导致森林砍伐,这催生了对于精确资产级环境 impact 数据的迫切需求。现有数据库因过度依赖宽泛的财务指标和人工数据收集而缺乏必要细节,制约了法规遵从与环境建模的准确性。本研究提出一种自动化端到端数据提取流程,利用大语言模型(LLMs)构建、清理和验证结构化数据库,特别针对高森林砍伐风险行业。该流程创新性地采用基于指令-角色-零样本思维链(IRZ-CoT)的提示策略提升数据提取精度,并引入检索增强验证(RAV)机制,通过实时网络搜索提高数据可靠性。在美国证券交易委员会EDGAR系统采矿、石油天然气及公用事业领域备案文件的应用表明,相较于传统零样本提示方法,该流程在提取准确性与验证覆盖度方面均有显著提升。本研究推动了自然语言处理技术在法规遵从、企业社会责任(CSR)及环境社会治理(ESG)领域的自动化应用,具有广泛的行业适用性。
HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics
Abstract
arXiv:2505.05602v1 Announce Type: new Abstract: As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.
摘要
随着大语言模型(LLMs)及其他人工智能系统的快速发展,从本质上具有随机性的输出中稳健地评估其能力,并系统量化这些评估中的不确定性变得愈发重要。此外,先进的人工智能评估通常具有嵌套的层次结构,展现出高度复杂性,且测试最先进人工智能系统的成本高昂。为应对这些挑战,我们提出了HiBayES——一个可推广的层次贝叶斯建模框架,专为人工智能评估统计而设计。HiBayES支持在经典问答基准测试和高级智能体评估中进行稳健推断,尤其适用于低数据场景(如每次评估少于20个数据点)。该框架基于广义线性模型(GLMs)、贝叶斯数据分析和形式化模型比较,能够提供严格的不确定性量化和稳健的参数估计。本文全面介绍了HiBayES,包括示例演示、与传统统计方法的对比,以及实现多层次贝叶斯GLMs的实用指南。此外,我们还提供了开箱即用的HiBayES软件包[4](测试版)。